Deepfake Detection Using Convolutional Vision Transformers and Convolutional Neural Networks

Authors: Arifa Ok, Mr. Pramod K

DOI Link: https://doi.org/10.22214/ijraset.2025.67377

Abstract

Deepfake technology has rapidly advanced in recent years, creating highly realistic fake videos that can be difficult to distinguish from real ones. The rise of social media platforms and online forums has exacerbated the challenges of detecting misinformation and malicious content. This study leverages many papers on artificial intelligence techniques to address deepfake detection. This research proposes a deep learning (DL)-based method for detecting deepfakes. The system comprises three components: preprocessing, detection, and prediction. Preprocessing includes frame extraction, face detection, alignment, and feature cropping. Convolutional neural networks (CNNs) are employed in the eye and nose feature detection phase. A CNN combined with a vision transformer is also used for face detection. The prediction component employs a majority voting approach, merging results from the three models applied to different features, leading to three individual predictions. The model is trained on various face images using FaceForensics?? and DFDC datasets. Multiple performance metrics, including accuracy, precision, F1, and recall, are used to assess the proposed model’s performance. The experimental results indicate the potential and strengths of the proposed CNN that achieved enhanced performance with an accuracy of 97%, while the CViT-based model achieved 85% using the FaceForences++ dataset and demonstrated significant improvements in deepfake detection compared to recent studies, affirming the potential of the suggested framework for detecting deepfakes on social media. This study contributes to a broader understanding of CNN-based DL methods for deepfake detection

Introduction

Summary:

The text discusses various deep learning techniques used to detect fake videos, particularly deepfakes, which have become a significant challenge due to their realistic and misleading nature. Social media platforms like WhatsApp, Twitter, Facebook, and YouTube actively work to filter out such manipulated content to prevent misinformation and harm.

Deepfakes typically involve swapping a person’s face in a video with another’s using deep learning, leading to potentially harmful consequences such as defamation and financial misinformation. Detecting deepfakes is difficult due to the videos’ high diversity and complexity.

A common approach uses convolutional neural networks (CNNs) for feature extraction, combined with vision transformers (ViTs) for image recognition. The proposed detection framework has three main stages:

Preprocessing: Extract frames, detect and align faces, then crop eyes and nose regions.
Detection: Use CNN models focused on eye and nose features, and a combined CNN-ViT model for full-face detection.
Prediction: Apply majority voting across the three models to enhance accuracy.

The method leverages datasets like FaceForensics++ for training and validation. Although the combined model improves detection reliability by addressing the limitations of single-model approaches, it requires significant computational resources. Future work aims to reduce data requirements and explore additional facial features for improved detection.

The paper also reviews existing literature on deepfake detection datasets, techniques, and challenges, emphasizing the growing need for robust detection mechanisms as deepfake technology evolves.

Conclusion

In this study, we introduced a groundbreaking method for deepfake detection, leveraging a fusion of distinct facial features and a comprehensive dataset enhanced by meticulous preprocessing. Our strategy entailed the development of a composite model, integrating three sub-models, each specializing in the recognition of deepfakes by analyzing specific facial elements: the entire face, the eyes, and the nose. Our tailored data processing techniques for each sub model further strengthen this multifaceted approach, circumventing the constraints typically encountered in single algorithm detection methods. Our training regimen utilized an expansive array of facial images from the most extensive dataset, such as FaceForensics++. This extensive dataset was pivotal in refining our model/s ability to discern physical anomalies indicative of deepfakes. The empirical evidence from our tests revealed a significant enhancement in accuracy and efficiency over existing deepfake detection methods, thereby establishing the superiority of our approach. A standout feature of our method is its robust performance across diverse scenarios, encompassing various environmental conditions and facial orientations, illustrating its practical applicability in real world settings. This adaptability underscores our model’s ability to identify deepfakes with high physical fidelity, an essential attribute in the current digital era. The implications of our work are far-reaching, addressing the pressing demand for reliable deepfake detection to thwart the proliferation of misinformation and other harmful digital content. The application of our approach has the potential to safeguard individuals, organizations, and society at large from the adverse impacts of deepfakes, thereby contributing significantly to digital security and integrity. Although our results are promising, we recognize the scope for fur there enhancement. Future research could delve into integrating additional facial features or employing alternative datasets, aiming to augment the physical accuracy and operational efficiency of deepfake detection. Such advancements will fortify our method’s effectiveness and contribute to the broader field of digital media authenticity.

References

[1] FaceApp. Accessed: Jan. 4, 2021. [Online]. Available: https://www.faceapp.com/ [2] FakeApp. Accessed: Jan. 4, 2021. [Online]. Available: https://www.fakeapp.org/ [3] G. Oberoi. Exploring DeepFakes. Accessed: Jan. 4, 2021. [Online].Available:https://goberoi.com/exploring-deepfakes-20c9947c22d9 [4] J. Hui. How Deep Learning Fakes Videos (Deepfake) and How toDetect it. Accessed: Jan. 4, 2021.[Online]. Available: https://medium.com/how-deep-learning-fakes-videos-deepfakes-and-how-to-detect-it-c0b50fbf7cb9 [5] I. Goodfellow, J. P. Abadie, M. Mirza, B. Xu, D. W. Farley, S. Ozair,A. Courville, and Y. Bengio,‘‘Generative adversarial nets,’’ in Proc. 27thInt. Conf. Neural Inf. Process. Syst. (NIPS), vol. 2.Cambridge, MA, USA:MIT Press, 2014, pp. 2672–2680. [6] G. Patrini, F. Cavalli, and H. Ajder, ‘‘The state of deepfakes:Reality under attack,’’ Deeptrace B.V.,Amsterdam, The Nether-lands, Annu. Rep. v.2.3., 2018. [Online]. Available: https://s3.eu-west-2.amazonaws.com/rep2018/2018-the-state-of-deepfakes.pdfVOLUME 10, 2022 [7] J. Thies, M. Zollhofer, M. Stamminger, C. Theobalt, and M. Niessner,‘‘Face2Face: Real-time facecapture and reenactment of RGB videos,’’in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR),Las Vegas,NV, USA, Jun. 2016, pp. 2387–2395, doi: 10.1109/CVPR.2016.262. [8] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, ‘‘Unpaired image-to-imagetranslation using cycle-consistent adversarial networks,’’ in Proc. IEEEInt. Conf. Comput. Vis. (ICCV), Venice, Oct. 2017, pp.2242–2251, doi:10.1109/ICCV.2017.244. [9] S. Suwajanakorn, S. M. Seitz, and I. K. Shlizerman, ‘‘SynthesizingObama: Learning lip sync fromaudio,’’ ACM Trans. Graph., vol. 36,no. 4, p. 95, 2017. [10] L. Matsakis. Artificial Intelligence is Now Fighting Fake Porn. Accessed:Jan. 4, 2021. [Online].Available: https://www.wired.com/story/gfycat-artificial-intelligence-deepfakes/ [11] A. Rössler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies, and M. Nießner,‘‘FaceForensics: Alarge-scale video dataset for forgery detection inhuman faces,’’ 2018, arXiv:1803.09179. [12] H. Kim, P. Garrido, A. Tewari, W. Xu, J. Thies, M. Niessner,P. Pérez, C. Richardt, M. Zollhöfer,and C. Theobalt, ‘‘Deep video por-traits,’’ ACM Trans. Graph., vol. 37, no. 4, pp. 1–14, Aug. 2018,doi:10.1145/3197517.3201283. [13] C. Chan, S. Ginosar, T. Zhou, and A. A. Efros, ‘‘Everybody dancenow,’’2018, arXiv:1808.07371. [14] T. Karras, S. Laine, and T. Aila, ‘‘A style-based generator architecturefor generative adversarialnetworks,’’ in Proc. IEEE/CVF Conf. Com-put. Vis. Pattern Recognit. (CVPR), Long Beach, CA, USA,Jun. 2019,pp. 4396–4405, doi: 10.1109/CVPR.2019.00453. [15] D. Budgen and P. Brereton, ‘‘Performing systematic literature reviews insoftware engineering,’’in Proc. 28th Int. Conf. Softw. Eng., New York,NY, USA, May 2006, pp. 1051–1052, doi:10.1145/1134285.1134500. [16] Z. Stapic, E. G. Lopez, A. G. Cabot, L. M. Ortega, and V. Strahonja, ‘‘Performing systematicliterature review in software engineering,’’ in Proc.23rd Central Eur. Conf. Inf. Intell. Syst. (CECIIS),Varazdin, Croatia,Sep. 2012, pp. 441–447. [17] B. Kitchenham, ‘‘Procedures for performing systematic reviews,’’Softw. Eng. Group; Nat. ICTAust., Keele; Eversleigh, KeeleUniv.,Keele, U.K., Tech. Rep. TR/SE-0401; NICTA Tech. Rep.0400011T.1,2004. [18] B. Kitchenham and S. Charters, ‘‘Guidelines for performing systematicliterature reviews insoftware engineering,’’ Softw. Eng. Group; KeeleUniv., Durham University Joint, Durham, U.K., Tech.Rep. EBSE-2007-01, 2007. [19] M. A. Babar and H. Zhang, ‘‘Systematic literature reviews in softwareengineering: Preliminaryresults from interviews with researchers,’’ inProc. 3rd Int. Symp. Empirical Softw. Eng. Meas., LakeBuena Vista, FL,USA, Oct. 2009, pp. 346–355, doi: 10.1109/ESEM.2009.5314235. [20] H. Do, S. Elbaum, and G. Rothermel, ‘‘Supporting controlled experimen-tation with testingtechniques: An infrastructure and its potential impact

Copyright

Copyright © 2025 Arifa Ok, Mr. Pramod K. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download Paper

Paper Id : IJRASET67377

Publish Date : 2025-03-10

ISSN : 2321-9653

Publisher Name : IJRASET

DOI Link : Click Here